Goal In this assignment, you'll make a first pass look at your newly adopted text collection similar to the Wolfram Alpha's view.

Title, author, and other metadata. First, print out some summary information that gives the background explaining what this collection is and where it comes from:

Presdiential Debates Text Collection

This collection contains transcripts of United States Presidential debates from 1960 to the present. These transcripts are taken from The Commission on Presidentail Debates. It contains 39 text files with some HTML markup.


In [1]:
import nltk
import re
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

First, load in the file or files below. First, take a look at your text. An easy way to get started is to first read it in, and then run it through the sentence tokenizer to divide it up, even if this division is not fully accurate. You may have to do a bit of work to figure out which will be the "opening phrase" that Wolfram Alpha shows. Below, write the code to read in the text and split it into sentences, and then print out the opening phrase.


In [2]:
# I created a module to make it easier to load a text corpus. 
import corpii

debates = nltk.clean_html(corpii.load_pres_debates().raw())
sents = sent_tokenizer.sentences_from_text(debates)
print sents[0]


October 15, 1992 First Half Debate Transcript 
 
 October 15, 1992 
 The Second Clinton-Bush-Perot Presidential Debate (First Half of Debate) 
 This is the first half of the transcript of the Richmond debate.

Next, tokenize. Look at the several dozen sentences to see what kind of tokenization issues you'll have. Write a regular expression tokenizer, using the nltk.regexp_tokenize() as seen in class, to do a nice job of breaking your text up into words. You may need to make changes to the regex pattern that is given in the book to make it work well for your text collection.

Note that this is the key part of the assignment. How you break up the words will have effects down the line for how you can manipulate your text collection. You may want to refine this code later.


In [3]:
# just looking at some sentence
sents[0:50]


Out[3]:
['October 15, 1992 First Half Debate Transcript \n \n October 15, 1992 \n The Second Clinton-Bush-Perot Presidential Debate (First Half of Debate) \n This is the first half of the transcript of the Richmond debate.',
 'The October 15th "town hall" format debate was moderated by Carole Simpson.',
 'She explains the format in her opening remarks.',
 'The length of this printed transcript is approximately 20 pages.',
 'CAROLE SIMPSON: Good evening and welcome to this second of three presidential debates between the major candidates for president of the US.',
 'The candidates are the Republican nominee, President George Bush, the independent Ross Perot and Governor Bill Clinton, the Democratic nominee.',
 "My name is Carole Simpson, and I will be the moderator for tonight's 90-minute debate, which is coming to you from the campus of the University of Richmond in Richmond, Virginia.",
 "Now, tonight's program is unlike any other presidential debate in history.",
 "We're making history now and it's pretty exciting.",
 'An independent polling firm has selected an audience of 209 uncommitted voters from this area.',
 'The candidates will be asked questions by these voters on a topic of their choosing -- anything they want to ask about.',
 'My job as moderator is to, you know, take care of the questioning, ask questions myself if I think there needs to be continuity and balance, and sometimes I might ask the candidates to respond to what another candidate may have said.',
 'Now, the format has been agreed to by representatives of both the Republican and Democratic campaigns, and there is no subject matter that is restricted.',
 'Anything goes.',
 'We can ask anything.',
 'After the debate, the candidates will have an opportunity to make a closing statement.',
 "So, President Bush, I think you said it earlier -- let's get it on.",
 "PRESIDENT GEORGE BUSH: Let's go.",
 'SIMPSON: And I think the first question is over here.',
 'AUDIENCE QUESTION: Yes.',
 "I'd like to direct my question to Mr. Perot.",
 'What will you do as president to open foreign markets to fair competition from American business and to stop unfair competition here at home from foreign countries so that we can bring jobs back to the US?',
 "ROSS PEROT: That's right at the top of my agenda.",
 "We've shipped millions of jobs overseas and we have a strange situation because we have a process in Washington where after you've served for a while you cash in, become a foreign lobbyist, make $30,000 a month, then take a leave, work on presidential campaigns, make sure you've got good contacts and then go back out.",
 "Now, if you just want to get down to brass tacks, first thing you ought to do is get all these folks who've got these 1-way trade agreements that we've negotiated over the years and say fellas, we'll take the same deal we gave you.",
 "And they'll gridlock right at that point because for example, we've got international competitors who simply could not unload their cars off the ships if they had to comply -- you see, if it was a 2-way street, just couldn't do it.",
 'We have got to stop sending jobs overseas.',
 'To those of you in the audience who are business people: pretty simple.',
 "If you're paying $12, $13, $14 an hour for a factory worker, and you can move your factory south of the border, pay $1 an hour for labor, hire a young -- let's assume you've been in business for a long time.",
 "You've got a mature workforce.",
 "Pay $1 an hour for your labor, have no health care -- that's the most expensive single element in making the car.",
 'Have no environmental controls, no pollution controls and no retirement.',
 "And you don't care about anything but making money.",
 'There will be a job-sucking sound going south.',
 "If the people send me to Washington the first thing I'll do is study that 2000-page agreement and make sure it's a 2-way street.",
 'One last point here.',
 'I decided I was dumb and didn\'t understand it so I called a "Who\'s Who" of the folks that have been around it, and I said why won\'t everybody go south; they said it will be disruptive; I said for how long.',
 "I finally got 'em for 12 to 15 years.",
 'And I said, well, how does it stop being disruptive?',
 "And that is when their jobs come up from a dollar an hour to $6 an hour, and ours go down to $6 an hour; then it's leveled again, but in the meantime you've wrecked the country with these kind of deals.",
 'We got to cut it out.',
 'SIMPSON: Thank you, Mr. Perot.',
 'I see that the president has stood up, so he must have something to say about this.',
 "BUSH: Carole, the thing that saved us in this global economic slowdown has been our exports, and what I'm trying to do is increase our exports.",
 "And if indeed all the jobs were going to move south because there are lower wages, there are lower wages now and they haven't done that.",
 'And so I have just negotiated with the president of Mexico the North American Free Trade Agreement -- and the prime minister of Canada, I might add -- and I want to have more of these free trade agreements, because export jobs are increasing far faster than any jobs that may have moved overseas.',
 "That's a scare tactic, because it's not that many.",
 "But any one that's here, we want to have more jobs here.",
 'And the way to do that is to increase our exports.',
 'Some believe in protection.']

In [4]:
token_regex= """(?x)
    [A-Z][A-Z ]+: #Catch name of speaker
    
    # taken from ntlk book
    |([A-Z]\.)+        # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.            # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens
"""

tokens = nltk.regexp_tokenize(debates, token_regex)
tokens


Out[4]:
['October',
 '15',
 ',',
 '1992',
 'First',
 'Half',
 'Debate',
 'Transcript',
 'October',
 '15',
 ',',
 '1992',
 'The',
 'Second',
 'Clinton-Bush-Perot',
 'Presidential',
 'Debate',
 '(',
 'First',
 'Half',
 'of',
 'Debate',
 ')',
 'This',
 'is',
 'the',
 'first',
 'half',
 'of',
 'the',
 'transcript',
 'of',
 'the',
 'Richmond',
 'debate',
 '.',
 'The',
 'October',
 '15th',
 '"',
 'town',
 'hall',
 '"',
 'format',
 'debate',
 'was',
 'moderated',
 'by',
 'Carole',
 'Simpson',
 '.',
 'She',
 'explains',
 'the',
 'format',
 'in',
 'her',
 'opening',
 'remarks',
 '.',
 'The',
 'length',
 'of',
 'this',
 'printed',
 'transcript',
 'is',
 'approximately',
 '20',
 'pages',
 '.',
 'CAROLE SIMPSON:',
 'Good',
 'evening',
 'and',
 'welcome',
 'to',
 'this',
 'second',
 'of',
 'three',
 'presidential',
 'debates',
 'between',
 'the',
 'major',
 'candidates',
 'for',
 'president',
 'of',
 'the',
 'US',
 '.',
 'The',
 'candidates',
 'are',
 'the',
 'Republican',
 'nominee',
 ',',
 'President',
 'George',
 'Bush',
 ',',
 'the',
 'independent',
 'Ross',
 'Perot',
 'and',
 'Governor',
 'Bill',
 'Clinton',
 ',',
 'the',
 'Democratic',
 'nominee',
 '.',
 'My',
 'name',
 'is',
 'Carole',
 'Simpson',
 ',',
 'and',
 'I',
 'will',
 'be',
 'the',
 'moderator',
 'for',
 'tonight',
 "'",
 's',
 '90-minute',
 'debate',
 ',',
 'which',
 'is',
 'coming',
 'to',
 'you',
 'from',
 'the',
 'campus',
 'of',
 'the',
 'University',
 'of',
 'Richmond',
 'in',
 'Richmond',
 ',',
 'Virginia',
 '.',
 'Now',
 ',',
 'tonight',
 "'",
 's',
 'program',
 'is',
 'unlike',
 'any',
 'other',
 'presidential',
 'debate',
 'in',
 'history',
 '.',
 'We',
 "'",
 're',
 'making',
 'history',
 'now',
 'and',
 'it',
 "'",
 's',
 'pretty',
 'exciting',
 '.',
 'An',
 'independent',
 'polling',
 'firm',
 'has',
 'selected',
 'an',
 'audience',
 'of',
 '209',
 'uncommitted',
 'voters',
 'from',
 'this',
 'area',
 '.',
 'The',
 'candidates',
 'will',
 'be',
 'asked',
 'questions',
 'by',
 'these',
 'voters',
 'on',
 'a',
 'topic',
 'of',
 'their',
 'choosing',
 'anything',
 'they',
 'want',
 'to',
 'ask',
 'about',
 '.',
 'My',
 'job',
 'as',
 'moderator',
 'is',
 'to',
 ',',
 'you',
 'know',
 ',',
 'take',
 'care',
 'of',
 'the',
 'questioning',
 ',',
 'ask',
 'questions',
 'myself',
 'if',
 'I',
 'think',
 'there',
 'needs',
 'to',
 'be',
 'continuity',
 'and',
 'balance',
 ',',
 'and',
 'sometimes',
 'I',
 'might',
 'ask',
 'the',
 'candidates',
 'to',
 'respond',
 'to',
 'what',
 'another',
 'candidate',
 'may',
 'have',
 'said',
 '.',
 'Now',
 ',',
 'the',
 'format',
 'has',
 'been',
 'agreed',
 'to',
 'by',
 'representatives',
 'of',
 'both',
 'the',
 'Republican',
 'and',
 'Democratic',
 'campaigns',
 ',',
 'and',
 'there',
 'is',
 'no',
 'subject',
 'matter',
 'that',
 'is',
 'restricted',
 '.',
 'Anything',
 'goes',
 '.',
 'We',
 'can',
 'ask',
 'anything',
 '.',
 'After',
 'the',
 'debate',
 ',',
 'the',
 'candidates',
 'will',
 'have',
 'an',
 'opportunity',
 'to',
 'make',
 'a',
 'closing',
 'statement',
 '.',
 'So',
 ',',
 'President',
 'Bush',
 ',',
 'I',
 'think',
 'you',
 'said',
 'it',
 'earlier',
 'let',
 "'",
 's',
 'get',
 'it',
 'on',
 '.',
 'PRESIDENT GEORGE BUSH:',
 'Let',
 "'",
 's',
 'go',
 '.',
 'SIMPSON:',
 'And',
 'I',
 'think',
 'the',
 'first',
 'question',
 'is',
 'over',
 'here',
 '.',
 'AUDIENCE QUESTION:',
 'Yes',
 '.',
 'I',
 "'",
 'd',
 'like',
 'to',
 'direct',
 'my',
 'question',
 'to',
 'Mr',
 '.',
 'Perot',
 '.',
 'What',
 'will',
 'you',
 'do',
 'as',
 'president',
 'to',
 'open',
 'foreign',
 'markets',
 'to',
 'fair',
 'competition',
 'from',
 'American',
 'business',
 'and',
 'to',
 'stop',
 'unfair',
 'competition',
 'here',
 'at',
 'home',
 'from',
 'foreign',
 'countries',
 'so',
 'that',
 'we',
 'can',
 'bring',
 'jobs',
 'back',
 'to',
 'the',
 'US',
 '?',
 'ROSS PEROT:',
 'That',
 "'",
 's',
 'right',
 'at',
 'the',
 'top',
 'of',
 'my',
 'agenda',
 '.',
 'We',
 "'",
 've',
 'shipped',
 'millions',
 'of',
 'jobs',
 'overseas',
 'and',
 'we',
 'have',
 'a',
 'strange',
 'situation',
 'because',
 'we',
 'have',
 'a',
 'process',
 'in',
 'Washington',
 'where',
 'after',
 'you',
 "'",
 've',
 'served',
 'for',
 'a',
 'while',
 'you',
 'cash',
 'in',
 ',',
 'become',
 'a',
 'foreign',
 'lobbyist',
 ',',
 'make',
 '$30',
 ',',
 '000',
 'a',
 'month',
 ',',
 'then',
 'take',
 'a',
 'leave',
 ',',
 'work',
 'on',
 'presidential',
 'campaigns',
 ',',
 'make',
 'sure',
 'you',
 "'",
 've',
 'got',
 'good',
 'contacts',
 'and',
 'then',
 'go',
 'back',
 'out',
 '.',
 'Now',
 ',',
 'if',
 'you',
 'just',
 'want',
 'to',
 'get',
 'down',
 'to',
 'brass',
 'tacks',
 ',',
 'first',
 'thing',
 'you',
 'ought',
 'to',
 'do',
 'is',
 'get',
 'all',
 'these',
 'folks',
 'who',
 "'",
 've',
 'got',
 'these',
 '1-way',
 'trade',
 'agreements',
 'that',
 'we',
 "'",
 've',
 'negotiated',
 'over',
 'the',
 'years',
 'and',
 'say',
 'fellas',
 ',',
 'we',
 "'",
 'll',
 'take',
 'the',
 'same',
 'deal',
 'we',
 'gave',
 'you',
 '.',
 'And',
 'they',
 "'",
 'll',
 'gridlock',
 'right',
 'at',
 'that',
 'point',
 'because',
 'for',
 'example',
 ',',
 'we',
 "'",
 've',
 'got',
 'international',
 'competitors',
 'who',
 'simply',
 'could',
 'not',
 'unload',
 'their',
 'cars',
 'off',
 'the',
 'ships',
 'if',
 'they',
 'had',
 'to',
 'comply',
 'you',
 'see',
 ',',
 'if',
 'it',
 'was',
 'a',
 '2-way',
 'street',
 ',',
 'just',
 'couldn',
 "'",
 't',
 'do',
 'it',
 '.',
 'We',
 'have',
 'got',
 'to',
 'stop',
 'sending',
 'jobs',
 'overseas',
 '.',
 'To',
 'those',
 'of',
 'you',
 'in',
 'the',
 'audience',
 'who',
 'are',
 'business',
 'people',
 ':',
 'pretty',
 'simple',
 '.',
 'If',
 'you',
 "'",
 're',
 'paying',
 '$12',
 ',',
 '$13',
 ',',
 '$14',
 'an',
 'hour',
 'for',
 'a',
 'factory',
 'worker',
 ',',
 'and',
 'you',
 'can',
 'move',
 'your',
 'factory',
 'south',
 'of',
 'the',
 'border',
 ',',
 'pay',
 '$1',
 'an',
 'hour',
 'for',
 'labor',
 ',',
 'hire',
 'a',
 'young',
 'let',
 "'",
 's',
 'assume',
 'you',
 "'",
 've',
 'been',
 'in',
 'business',
 'for',
 'a',
 'long',
 'time',
 '.',
 'You',
 "'",
 've',
 'got',
 'a',
 'mature',
 'workforce',
 '.',
 'Pay',
 '$1',
 'an',
 'hour',
 'for',
 'your',
 'labor',
 ',',
 'have',
 'no',
 'health',
 'care',
 'that',
 "'",
 's',
 'the',
 'most',
 'expensive',
 'single',
 'element',
 'in',
 'making',
 'the',
 'car',
 '.',
 'Have',
 'no',
 'environmental',
 'controls',
 ',',
 'no',
 'pollution',
 'controls',
 'and',
 'no',
 'retirement',
 '.',
 'And',
 'you',
 'don',
 "'",
 't',
 'care',
 'about',
 'anything',
 'but',
 'making',
 'money',
 '.',
 'There',
 'will',
 'be',
 'a',
 'job-sucking',
 'sound',
 'going',
 'south',
 '.',
 'If',
 'the',
 'people',
 'send',
 'me',
 'to',
 'Washington',
 'the',
 'first',
 'thing',
 'I',
 "'",
 'll',
 'do',
 'is',
 'study',
 'that',
 '2000-page',
 'agreement',
 'and',
 'make',
 'sure',
 'it',
 "'",
 's',
 'a',
 '2-way',
 'street',
 '.',
 'One',
 'last',
 'point',
 'here',
 '.',
 'I',
 'decided',
 'I',
 'was',
 'dumb',
 'and',
 'didn',
 "'",
 't',
 'understand',
 'it',
 'so',
 'I',
 'called',
 'a',
 '"',
 'Who',
 "'",
 's',
 'Who',
 '"',
 'of',
 'the',
 'folks',
 'that',
 'have',
 'been',
 'around',
 'it',
 ',',
 'and',
 'I',
 'said',
 'why',
 'won',
 "'",
 't',
 'everybody',
 'go',
 'south',
 ';',
 'they',
 'said',
 'it',
 'will',
 'be',
 'disruptive',
 ';',
 'I',
 'said',
 'for',
 'how',
 'long',
 '.',
 'I',
 'finally',
 'got',
 "'",
 'em',
 'for',
 '12',
 'to',
 '15',
 'years',
 '.',
 'And',
 'I',
 'said',
 ',',
 'well',
 ',',
 'how',
 'does',
 'it',
 'stop',
 'being',
 'disruptive',
 '?',
 'And',
 'that',
 'is',
 'when',
 'their',
 'jobs',
 'come',
 'up',
 'from',
 'a',
 'dollar',
 'an',
 'hour',
 'to',
 '$6',
 'an',
 'hour',
 ',',
 'and',
 'ours',
 'go',
 'down',
 'to',
 '$6',
 'an',
 'hour',
 ';',
 'then',
 'it',
 "'",
 's',
 'leveled',
 'again',
 ',',
 'but',
 'in',
 'the',
 'meantime',
 'you',
 "'",
 've',
 'wrecked',
 'the',
 'country',
 'with',
 'these',
 'kind',
 'of',
 'deals',
 '.',
 'We',
 'got',
 'to',
 'cut',
 'it',
 'out',
 '.',
 'SIMPSON:',
 'Thank',
 'you',
 ',',
 'Mr',
 '.',
 'Perot',
 '.',
 'I',
 'see',
 'that',
 'the',
 'president',
 'has',
 'stood',
 'up',
 ',',
 'so',
 'he',
 'must',
 'have',
 'something',
 'to',
 'say',
 'about',
 'this',
 '.',
 'BUSH:',
 'Carole',
 ',',
 'the',
 'thing',
 'that',
 'saved',
 'us',
 'in',
 'this',
 'global',
 'economic',
 'slowdown',
 'has',
 'been',
 'our',
 'exports',
 ',',
 'and',
 'what',
 'I',
 "'",
 'm',
 'trying',
 'to',
 'do',
 'is',
 'increase',
 'our',
 'exports',
 '.',
 'And',
 'if',
 'indeed',
 'all',
 'the',
 'jobs',
 'were',
 'going',
 'to',
 'move',
 'south',
 'because',
 'there',
 'are',
 'lower',
 'wages',
 ',',
 'there',
 'are',
 'lower',
 'wages',
 'now',
 'and',
 'they',
 'haven',
 "'",
 't',
 'done',
 'that',
 '.',
 'And',
 'so',
 'I',
 'have',
 ...]

Compute word counts. Now compute your frequency distribution using a FreqDist over the words. Let's not do lowercasing or stemming yet. You can run this over the whole collection together, or sentence by sentence. Write the code for computing the FreqDist below.


In [5]:
freq = nltk.FreqDist(tokens)
freq.items()[:50]


Out[5]:
[('.', 33808),
 (',', 30447),
 ('the', 27491),
 ('to', 19416),
 ("'", 17270),
 ('that', 13426),
 ('of', 13197),
 ('I', 12865),
 ('and', 12603),
 ('a', 11134),
 ('in', 10341),
 ('we', 8214),
 ('you', 6825),
 ('s', 6588),
 ('is', 6248),
 ('have', 6042),
 ('it', 5968),
 ('for', 5193),
 ('And', 4007),
 ('on', 3749),
 ('this', 3680),
 ('t', 3439),
 ('be', 3319),
 ('are', 3258),
 ('with', 3237),
 ('not', 3213),
 ('our', 2928),
 ('We', 2642),
 ('people', 2537),
 ('do', 2524),
 ('they', 2482),
 ('as', 2434),
 ('was', 2385),
 ('re', 2375),
 ('?', 2339),
 ('what', 2321),
 ('think', 2279),
 ('about', 2228),
 ('ve', 2213),
 ('he', 2209),
 ('going', 2113),
 ('can', 2099),
 ('would', 2090),
 ('will', 2045),
 ('has', 1913),
 ('there', 1890),
 ('The', 1647),
 ('all', 1643),
 ('President', 1640),
 ('at', 1613)]

Creating a table. Python provides an easy way to line columns up in a table. You can specify a width for a string such as %6s, producing a string that is padded to width 6. It is right-justified by default, but a minus sign in front of it switches it to left-justified, so -3d% means left justify an integer with width 3. AND if you don't know the width in advance, you can make it a variable by using an asterisk rather than a number before the '*s%' or the '-*d%'. Check out this example (this is just fyi):


In [6]:
print '%-16s' % 'Info type', '%-16s' % 'Value'
print '%-16s' % 'number of words', '%-16d' % 100000


Info type        Value           
number of words  100000          

Word Properties Table Next there is a table of word properties, which you should compute (skip unique word stems, since we will do stemming in class on Wed). Make a table that prints out:

  1. number of words
  2. number of unique words
  3. average word length
  4. longest word

You can make your table look prettier than the example I showed above if you like!

You can decide for yourself if you want to eliminate punctuation and function words (stop words) or not. It's your collection!


In [7]:
word_count = len(tokens)
unique_count = len(set(tokens))
avg_length = sum(len(w) for w in tokens)/float(word_count)

# get longest word
longest = ''
for w in tokens:
    if len(w) >= len(longest):
        longest = w

print "{:<16}|{:<16}|{:<16}|{:<16}".format("# Words", "Unique Words", "Avg Length", "Longest Word")
print "{:<16,d}|{:<16,d}|{:<16.3f}|{:<16}".format(word_count, unique_count, avg_length, longest)


# Words         |Unique Words    |Avg Length      |Longest Word    
663,224         |16,143          |3.860           |LEAGUE OF WOMEN VOTERS EDUCATION FUND:

Most Frequent Words List. Next is the most frequent words list. This table shows the percent of the total as well as the most frequent words, so compute this number as well.


In [8]:
print "{:<16}|{:<16}|{:<16}".format("Word", "Count", "Frequency")

for i in freq.items()[0:50]:
    print "{:<16}|{:<16,d}|{:<16.3%}".format(i[0], i[1], freq.freq(i[0]))


Word            |Count           |Frequency       
.               |33,808          |5.098%          
,               |30,447          |4.591%          
the             |27,491          |4.145%          
to              |19,416          |2.928%          
'               |17,270          |2.604%          
that            |13,426          |2.024%          
of              |13,197          |1.990%          
I               |12,865          |1.940%          
and             |12,603          |1.900%          
a               |11,134          |1.679%          
in              |10,341          |1.559%          
we              |8,214           |1.238%          
you             |6,825           |1.029%          
s               |6,588           |0.993%          
is              |6,248           |0.942%          
have            |6,042           |0.911%          
it              |5,968           |0.900%          
for             |5,193           |0.783%          
And             |4,007           |0.604%          
on              |3,749           |0.565%          
this            |3,680           |0.555%          
t               |3,439           |0.519%          
be              |3,319           |0.500%          
are             |3,258           |0.491%          
with            |3,237           |0.488%          
not             |3,213           |0.484%          
our             |2,928           |0.441%          
We              |2,642           |0.398%          
people          |2,537           |0.383%          
do              |2,524           |0.381%          
they            |2,482           |0.374%          
as              |2,434           |0.367%          
was             |2,385           |0.360%          
re              |2,375           |0.358%          
?               |2,339           |0.353%          
what            |2,321           |0.350%          
think           |2,279           |0.344%          
about           |2,228           |0.336%          
ve              |2,213           |0.334%          
he              |2,209           |0.333%          
going           |2,113           |0.319%          
can             |2,099           |0.316%          
would           |2,090           |0.315%          
will            |2,045           |0.308%          
has             |1,913           |0.288%          
there           |1,890           |0.285%          
The             |1,647           |0.248%          
all             |1,643           |0.248%          
President       |1,640           |0.247%          
at              |1,613           |0.243%          

Most Frequent Capitalized Words List We haven't lower-cased the text so you should be able to compute this. Don't worry about whether capitalization comes from proper nouns, start of sentences, or elsewhere. You need to make a different FreqDist to do this one. Write the code here for the new FreqDist and the List itself. Show the list here.


In [9]:
cap_freq = nltk.FreqDist([w for w in tokens if re.match(r"[A-Z]", w)])

print "{:<16}|{:<16}|{:<16}".format("Word", "Count", "Frequency")

for i in cap_freq.items()[0:50]:
    print "{:<16}|{:<16,d}|{:<16.3%}".format(i[0], i[1], cap_freq.freq(i[0]))


Word            |Count           |Frequency       
I               |12,865          |15.565%         
And             |4,007           |4.848%          
We              |2,642           |3.196%          
The             |1,647           |1.993%          
President       |1,640           |1.984%          
But             |1,501           |1.816%          
Mr              |1,408           |1.703%          
It              |1,397           |1.690%          
Senator         |1,211           |1.465%          
That            |1,112           |1.345%          
America         |1,025           |1.240%          
You             |959             |1.160%          
American        |904             |1.094%          
Now             |838             |1.014%          
Governor        |834             |1.009%          
United          |812             |0.982%          
States          |763             |0.923%          
MR              |705             |0.853%          
They            |694             |0.840%          
He              |682             |0.825%          
Well            |662             |0.801%          
So              |627             |0.759%          
Vice            |494             |0.598%          
BUSH:           |476             |0.576%          
LEHRER:         |476             |0.576%          
This            |473             |0.572%          
If              |472             |0.571%          
Congress        |465             |0.563%          
There           |462             |0.559%          
Bush            |456             |0.552%          
What            |437             |0.529%          
Let             |420             |0.508%          
In              |419             |0.507%          
Americans       |367             |0.444%          
OBAMA:          |357             |0.432%          
Security        |340             |0.411%          
Medicare        |328             |0.397%          
Social          |320             |0.387%          
John            |312             |0.377%          
Iraq            |309             |0.374%          
McCain          |304             |0.368%          
GORE:           |293             |0.354%          
MODERATOR:      |283             |0.342%          
Obama           |280             |0.339%          
Thank           |266             |0.322%          
Kennedy         |248             |0.300%          
No              |248             |0.300%          
Senate          |245             |0.296%          
Clinton         |243             |0.294%          
When            |240             |0.290%          

Sentence Properties Table This summarizes number of sentences and average sentence length in words and characters (you decide if you want to include stopwords/punctuation or not). Print those out in a table here.


In [10]:
sent_count = len(sents)
avg_char = sum(len(s) for s in sents)/float(sent_count)
avg_word = sum([len(nltk.regexp_tokenize(s, token_regex)) for s in sents])/float(sent_count)

print "{:<16}|{:<16}|{:<16}".format("# Sents", "Avg Words", "Avg Char")
print "{:<16,d}|{:<16.3f}|{:<16.3f}".format(sent_count, avg_word,  avg_char)


# Sents         |Avg Words       |Avg Char        
33,694          |19.684          |92.063